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Abstract 

Low rank tensor learning, such as tensor completion and multilinear multitask learning, has received much attention in recent 
years. In this paper, we propose higher order matching pursuit for low rank tensor learning problems with a convex or a nonconvex 
cost function, which is a generalization of the matching pursuit type methods. At each iteration, the main cost of the proposed 
methods is only to compute a rank-one tensor, which can be done efficiently, making the proposed methods scalable to large 
scale problems. Moreover, storing the resulting rank-one tensors is of low storage requirement, which can help to break the curse 
of dimensionality. The linear convergence rate of the proposed methods is established in various circumstances. Along with the 
main methods, we also provide a method of low computational complexity for approximately computing the rank-one tensors, 
with provable approximation ratio, which helps to improve the efficiency of the main methods and to analyze the convergence 
rate. Experimental results on synthetic as well as real datasets verify the efficiency and effectiveness of the proposed methods. 
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I. Introduction 

Tensors, appearing as the higher order generalization of vectors and matrices, make it possible to represent data that have 
intrinsically many dimensions, and give a better understanding of the relationship behind the information from a higher order 
perspective. In many machine learning problems such as tensor completion 0 ~ 0 - multilinear multitask learning (MLMTL) 
0-0 and tensor regression |:8), one often aims at learning a tensor that has low rankness. For example, in tensor completion, 
the goal is to learn a low rank tensor provided that only partial observations are available. In the context of MLMTL, to 
allow for common information shared between tasks to pursuit better generalization, by learning several tasks simultaneously, 
where each task is indexed by more than two indices, all the tasks can be represented by a tensor assumed to lie in a low 
dimensional spaces. In tensor regression, to better understand the information behind high dimensionality data, the weight 
vector is represented by a low rank tensor. These applications give rise to low rank tensor learning. 

Commonly speaking, to learn a low rank tensor, tensor learning minimizes a real-valued cost function F : T —► R subject 
to some constraints or with regularizations to encourage the low rank property of the learned tensor. Here T := R n i x - xn « 
denotes an TV-th order tensor space, and F(-) is a continuous function. A widely used regularization is the sum of modc-ci 
matrix nuclear norms 0-0, 0, which encourages the tensor to have low Tucker rank 00 - Some variations 

of the nuclear norm, such as the Schatten-p norm |3|, ||4), the latent norm and the scaled norm |6|, as well as other variations 
03 , 00 , have been studied. The advantage of the above approaches are their convexity, which enables them to be solved 
by many existing algorithms, while a main drawback is that all the approaches rely on solving singular value decompositions 
(SVD), which lacks scalability. Another category of approaches decomposes the tensor into several factors, and applies the 
alternating minimization rule to solve the resulting optimization problems, see, e.g., 0, 0, 0. This type of approaches 
avoids computing SVDs, but lacks global convergence analysis. Recently, tensor nuclear norm based algorithms are proposed 
in 0), 0|. This type of algorithms solves a simple and efficient subproblem at each iteration, which is scalable to large-scale 
problems. However they are specifically designed for convex F(-), and the stepsizes have some restrictions. 

In this paper, motivated by the simple and efficient matching pursuit (MP) methods for sparse approximation 03 - 03 - 
signal recovery f20) , matrix compoletion 0 and greedy method for tensor approximation |22) , we propose higher order 
matching pursuit (HoMP) methods for solving low rank tensor learning problems, either with a convex or nonconvex cost 
function. At each iteration, the classical MP selects an “atom” from a given dictionary, and then updates the new trial based 
on the linear combination of the current trial and the selected atom, with suitably chosen weights, whereas in tensor learning 
setting, the atom is a rank-one tensor, which has to be learned based on the gradient information of F(-). Finding such a rank- 
one tensor reduces to computing a tensor spectral norm, or known as the tensor singular value problem |23| , [[24) . Although 
solving such a subproblem exactly is NP-hard in general [25), approximation methods exist, and fortunately, as MP, HoMP 
allows to use an approximation solution. This feature makes HoMP particularly suitable for tensor learning. When choosing 
the weights, if the cost function F(-) is associated with a least squares loss, then three strategies, which are in accordance with 
matching pursuit 0, economic orthogonal MP ED (or relaxed MP, see, e.g., {26) ) and orthogonal matching pursuit 0), 
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can be applied. We then generalize HoMP to the case that F(-) is nonconvex, where the weights are chosen by minimizing 
a quadratic function which majorizes F(-). Along with the main HoMP methods, an efficient algorithm will be presented to 
approximately and efficiently solve the tensor singular value problem mentioned above, with a provable approximation ratio. 
This ratio is important in analyzing the convergence rate, which will be used extensively in Sect. IV Besides the efficiency, 
another advantage of HoMP-type methods is its low storage requirement, as will be explained in Sect. m 

The convergence rate of HoMP is analyzed in two specific problems: tensor completion and MLMTL. Specifically, we first 
show that, if F(-) is associated with the least squares loss, then HoMP converges linearly for the two specific problems. We 
then generalize our analysis to a class of loss functions, which may be nonconvex and includes many robust losses as special 
cases. Interestingly, the linear convergence rate can still be established in these cases. 

In a nutshell, our contribution is summarized as follows: 


1) We propose efficient HoMP methods for tensor learning, which are applicable for problems with with convex or nonconvex 
cost functions; 

2) We present an efficient method for selecting rank-one tensors, with provable approximation ratio. The ratio is important 
in analyzing the convergence rate. 

3) We establish linear convergence of the proposed HoMP methods, either for problems with convex or nonconvex cost 
functions. 


The rest of this paper is organized as follows. Sect. I-A gives preliminaries on tensors. Low rank tensor learning problems 
are formulated in Sect. [0] and specified by tensor completion and MLMTL. The HoMP-type methods, their related work, 
and an efficient algorithm for selecting rank-one tensors will be detailed in Sect. Ill Sect. IV is focused on analyzing the 
convergence rate. Numerical experiments will be conducted in Sect. [V] Finally, Sect. VI draws some conclusions. 


A. Preliminaries on tensors 

Vectors are written as (a, b , ...), matrices correspond to ( A , B,...), and tensors are written as (A, B, ■ ■ ■). T := R"i x?l 2 x---xnjv 
denotes an IV-th order tensor space. 

For two tensors A,B £ T, their inner product is given by (A, B) = A 1 ---ijrBi 1 ...i JV ■ The Frobenius norm 

of A is defined by ||A||f = ( A , A) 1 / 2 . The outer product of vectors Xj £ i = 1,..., N is denoted as xj ® • ■ • ® xjv and 
is a rank-one tensor in T defined by (xi (8) • • ■ (8) x.v),;., ■ ■■i N = JlyLi x ■ The mode-d tensor-matrix multiplication of a tensor 
X £ T with a matrix U £ R JdXnd is a tensor of size n\ X • • • x rid -i x Jd x rid +i x • • • X tin- 

1) Tensor-matrix mode-d unfolding and mode-(p,q) unfolding: The mode-d unfolding of tensor A is denoted as A.(d) by 
arranging the mode-d fibers to be the columns of the resulting matrix. For an IV-th order tensor A, the mode-(p, q) unfolding 
is to choose the (p, q) modes and merge them into the first mode (row) of the unfolding matrix, and the remaining N — p— q 
modes are merged into the second mode (column). The unfolding is denoted as At pq . N \ { Ptq \] and can be simplified as At pq) if 
necessary, where the semicolon specifies a new mode. We explain it by an example. For a 4-th order tensor A, its mode-(l, 2) 
unfolding is A [ 1 j 2 ; 3 , 4 ], defined by ( A [li2;3i4 ])( il _ 1)fl2+i2i ( i3 _ 1 ) n4+i4 = 

2) Tensor rank: There are mainly two types of tensor rank, namely the CP-rank and the Tucker-rank 0- The CP-rank is 
defined as the minimum integer R such that for a tensor X, it can be factorized as a sum of R rank-one tensors, rankcp(-) 
will be used to denote the CP-rank in this paper. The Tucker-rank of an IV-th order tensor IT is an A tuple, whose z-th entry 
is the rank of the unfolding matrix Xuy 

3) Tensor singular value problem: For a tensor X £ T, its largest singular value is defined as (23) 


max||y|| F= i jran k CP (y;)=i (T",3I) > 


(1) 


which is equivalent to the tensor spectral norm 
problem is NP-hard in general 


|IT|| 2 and is dual to the tensor spectral norm, see, e.g., 1241. Solving such a 


II. Low Rank Tensor Learning Formulation 

As introduced in the introduction, low rank tensor learning seeks a low rank tensor solution via minimizing a cost function. 
Mathematically, the model considered in this work is of the following general form 

minyy.F(W) s.t. rankcp(W) < K, (2) 

where F : T —> ffi. denotes the cost function which is continuously differentiable, either convex or nonconvex. The low CP-rank 
constraint encourages the learned tensor to have low rank structure. We specify the model 0 via two specific applications: 
tensor completion and MLMTL. 
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A. Tensor completion 

The goal of tensor completion is to infer a tensor (possibly low rank) from its partial observations. Mathematically, given a 
partially observed tensor Bo where V. denotes the index set of observed entries, the problem can be formulated as 

minyy eT r(W) s.t. Wn = Bn, 

where r(-) is used to control the low rankness of the tensor, such as rankcp(-). and the sum of nuclear norms. In our setting, 
letting F(W) := E(i 1 ,...,i* r ) 6 n £(Wi 1 ...i N — with a specific loss £(■), we model the problem as follows 

min rankcp (w)<K F(W), 

with K being a positive integer to control the CP-rank of VV. 

B. Multilinear multitask learning (MLMTL) 

MLMTL learns many tasks simultaneously, where each task is indexed by more than two indices An example is to 

predict consumers’ ratings for restaurants, where each rating contains several aspects. Then each task is indexed by consumer 
and aspect. Since each task can be represented by a weight vector, all the tasks jointly yield a third-order tensor, see, e.g., |5j. 
In the following, we restrict ourselves to multilinear multitask regression. Specifically, We consider T tasks, each of which is 
specified by a weight vector w* € R /; that corresponds to a linear function (x, w ( ), where x is an observed input. Provided 
that the associated observed output is y, we employ a specific loss £((x, w 4 ) — y). For each task w 4 , a finite set of training 
samples {(x 4 , t/ 4 )}™\ is available, and we aim at minimizing the empirical risk F( W) defined as 

F(W) := V TO t _1 Z] mt i £ (( x i’ w ‘) -2/i)> 

ZJ t —1 ^ -' 2=1 

where W = [w * 1 ,..., w T ] £ M. DxT is the weight matrix. For each task f, we assume that it is related to N > 2 indices, each of 
which varies from 1 to rii, i = 1, ..., N. That is, the task w 4 can be identified by the indices (*i,. .., ijy) £ [ni] x • • • x [njv]- 
In this case, we have T = Ilili n i■ Correspondingly, the matrix W can be folded to an (N + l)-th order tensor VV, with size 
D x n\ x • • • x tin, and W can be regarded as the mode-1 unfolding of W. We also denote F(W) = F( W). Assuming that 
the tasks share certain common structure, our model is defined as 
m in rankcP (w)<K F(W), 

Therefore, both of our models of tensor completion and MLMTL adopts the CP-rank to control the low rankness of the 
learned tensor, which is quite different from models based on nuclear norm regularizations Q-0. 0. 

III. Higher Order Matching Pursuit 

Having presenting our low rank tensor learning problem 0, the goal of this section is to introduce the Higher order Matching 
Pursuit (HoMP) methods to solve it. HoMP methods are presented in Algorithm [T] 


Algorithm 1 Higher order Matching Pursuit (HoMP) for low rank tensor learning 

Input: W (0) = 0; a 0 = 0; K > 1. 

Output: the resulting tensor VV' K V 

for k = 1 to I< do 

• Select a normalized rank-one tensor <S' A: V 

(VF(W (fc) ),S (fe) ) >/3 max (VF(W (fc) ), S) (0 < p < 1) (3) 

|5||p = l,rank C p(5) = l 


• Update 

1) yvH fc+1 ) =W (k) +aS (k \ a = argmin a F{W {k) + aS^) 

2) yv ,(fe+1) = + 02<S( fe ), (a£,a^) = argmin( aija2 ) F{a\W^ k ' > + a 2 S^) 

3) W (fe+1) =ELo s *‘ S( * ) ’ « = («o> ••• I afc) T = argmin aeRls+ i F(Et=o Q 'i' ?(i) ) 

4) yv ,(fe+1) = W (fe) +aSW, a = -(VF(W (fc) ),5^)/i (HoMP-G) 

end for 


(HoMP-LS) 

(HoRMP-LS) 

(HoOMP-LS) 

{L is a Lipschitz constant) 


We describe the method in more details. Given a cost function F(-) of low rank tensor learning, with initial guess W’*- 0 ) being 
the zero tensor, at each iteration, HoMP can be divided into two steps: the selection step and the updating step. The selection 
step chooses certain atom, which is a rank-one tensor by solving the tensor singular value problem 0 approximately 
with an approximation ratio /J, where X = V F{}V lkl ), as shown in 0. The approximation ratio is important in convergence 
rate analysis, as will be shown in Sect. [TV] The updating step adaptively computes the weights and updates the new trial. This 
step has two cases: 

1) If F(-) is associated with a least squares loss, and F(-) with respect to weights a is quadratic, then three strategies can be 
considered: higher order MP with least squares loss (HoMP-LS), higher order relaxed MP with least squares loss (HoRMP-LS) 
and higher order orthogonal MP with least squares loss (HoOMP-LS), as shown in Algorithm [I] where all the weights can be 
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computed by solving certain least squares problems. These strategies are respectively in accordance with MP (IT) , economic 
orthogonal MP [|2TJ (or relaxed MP, see, e.g., (26) ) and orthogonal MP ]T8) , and generalize them to tensor learning setting. 

2) If F(-) is associated with a general loss which is possibly nonconvex, while the gradient Vf’(') is Lipschitz continuous 
with Lipschitz constant L, then the strategy higher order MP with a general loss (HoMP-G) can be applied. In fact, the weight 
is computed by minimizing a quadratic function that majorizes F(-) at W’( fc + 1 ). 

If Algorithm [T] stops within K iterations, then it generates a feasible solution to (j2}. Comparing between HoMP, HoRMP 
and HoOMP, one can see that HoOMP may obtain a better new trial, as it updates the weights most greedily, while HoMP 
updates the weights least greedily. However, computing the weights for HoOMP may be time consuming, as it may require to 
solve a linear equations system. HoRMP, which considers the linear combination between the current trial and the new atom, 
can be regarded as a trade-off between HoMP and HoOMP. 

The main computational cost of HoMP-type methods is the selection step 0- Comparing with those methods that require 
to solve SVD, solving ([T|i will be more efficient, as will be detailed in subsection |III-B| 

Another advantage of HoMP-type methods is the low storage requirement. Suppose we work in a tensor space T and the 
methods stop within K iterations with I\ not too large; since the learned tensor is a combination of some rank-one tensors, 
and each rank-one tensor can be represented by the outer product of N vectors, the whole tensor can be stored by using 
71 ■' ' K storage only, against using Jlili 71 ■> to storc the whole tensor. This can help to break the curse of dimensionality. 

A. Related work 

As mentioned earlier, HoMP-type methods are motivated by MP-type methods for sparse approximation G3-Q5J, signal 
recovery and matrix completion (21) . The classical MPs iteratively select atoms from a redundant dictionary, one at a 
time, and then use their certain linear combination to approximate a given signal. Recently, ED generalizes MPs to matrix 
completion, where the cost function is least squares based, and proves the linear convergence of their methods. Our work 
generalizes ED in the following senses: 1) we generalize MPs to tensor learning problems; 2) our methods can be adapted 
to problems with a general loss; 3) the way of selecting atoms in tensor setting is more challenge than in matrix cases; 4) we 
establish linear convergence rate for a wide range of loss functions, which may possibly be nonconvex. 

Another issue related to HoMPs is the approach of successive rank-one approximation to tensors (SR1A), see, EZJ Sect. 
3.3] and the references therein. In general, consider a linear system A(W) ~ b where A is a linear operator, b is a vector and 
W is a low rank tensor to be determined. SR1A updates the new trial as yy( fe+1 ) = + S^ where is a rank-one 

tensor which minimizes some convex cost function E(A(X) — b), see, e.g., [281, [ [29) . Comparing with SR1A, HoMPs are 
more general in terms of the problems to be solved, the cost function F(-) to be minimized, and the strategies of choosing 
the weights. Moreover, finding the rank-one term S in SR1A is either intractable or needs an alternating minimization strategy 
[1221 without explicit approximation ratio, while we allow an approximation solution <[3j. 

HoMPs are also closely related to conditional gradient (CG) methods for tensor learning © m- CG, also known as 
the Frank-Wolfe method (30), is a classical method for constrained convex optimization problems, and regains attention in 
recent years, see, e.g., pTT At each iteration, CG also computes an atom. Then, different from MPs, CG performs a convex 
combination of the selected atom and the current trial, which restricts the new trial to still lie in the convex constraint. This 
difference leads to significant differences to the models behind MPs and CG, where MPs minimize the cost function constrained 
by a “hard” constraint, while CG minimizes the cost function constrained by its “soft” counterpart, such as L 0 norm versus 
L \ norm, matrix/tensor rank versus matrix/tensor nuclear norm. 


B. Efficiently computing the selection step 0 

Since solving (D exactly is NP-hard in general (25), the goal of this subsection is to present a method to find an approximation 
solution of 0 with low computational complexity, and to explicitly derive the approximation ratio. In the literature, several 
approximation methods have been proposed, e.g., the power-type methods (32)—(34), the Newton method (35), and others 


-|401. However, these methods are not very efficient in our setting. 

For a 2d-th order tensor A, our method is defined recursively as follows, where the output is a set of 2 d normalized vectors, 
whose outer product yields the approximation solution. 
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Subroutine 1: (xi,..., x 2 ( z) = ApproxSpectral2d(,4) 

1) If the order is 2, return the singular vector pair (x 1; x 2 ) 
corresponding to the leading singular value of A; 

2) Compute the (inexact) singular vector pair 

(x[ 12 ],X [3 2 d]) corresponding to the leading singular 

value of matrix Ar li 2 ; 3,...,2d]- 

3) Fold the vector xn 2 i to matrix X[ 1;2 ] and compute the 
leading singular vector pair xm and x™; 

4) Denote the (2d—2)-th order tensor y := Ax ixj x 2 xj; 

compute (x 3 ,..., x 2d ) = ApproxSpectral2d(>). 

5) Return (xi,... ,x 2d ). 

At each recursion of Subroutine [T] the dominant cost is Step 2), i.e., to compute the leading singular value of a matrix of size 
n 2 x n ( 2d - 2fc ) at the fc-th recursion, with 1 < k < (d— 1). Although the computational complexity of only computing the leading 
singular value is less than doing the full SVD, when the size goes high, this procedure still takes much time. Empirically we 
find that there is no need to compute the singular value precisely, and running a few power iterations is acceptable. That is, we 
can compute Step 2) inexactly. Suppose a few power iterations have been performed in Step 2), and we obtain a normalized 
vector pair (x [1)2 ], x [ 3 ,...,2<z]) such that A (lj 2 ) x [3i ... )2d ] = «(d) IIA(i, 2) || 2 X[i, 2 ] where 0 < a (d) < 1, and A ( i j2) is short for 
A-, 2 ; 3 __ 2 d]■ Then we can establish the following lower bound. For simplicity, we may assume « i = n 2 = • ■ ■ = n 2 d = n. 

Proposition 1: Let the order of the tensor be 2d. Suppose at the A:-th recursion, the vectors obtained in Step 2) are normalized 
and satisfy 

A(l j 2) x [3,...,2d-2fc+2] = a (fc)l|A(l, 2 )|| 2 X[ ll2 ], (4) 


where 0 < c < a^k) < 1, 1 < fc < d — 1 and c is a constant. Then there holds 


( A , xj . <8 ■ • • ® x 2d ) > 


IIfc=l a (/c)l|A(i ;2 )||2 

^max{0,3d/2—2} 


> YliJt a (/c)Mll 2 

^ rnax {0,3cZ/2— 2 } 


(5) 


Proof: We denote v = (A, xi <8 ■ ■ ■ < 8 > x 2 ( z), tensor A4i := A x i x^ x 2 xj and matrix M 2 := (A, ■ ® • (g> Ap.... ; 2 d])■ Here 
A[ 3 ;... ; 2 d] is a (2d — 2 )-th order tensor folded by the vector X [ 3 2( ;] generated in Step 2 ) of Subroutine |T| either exactly or 
inexactly, and the (z, j)-th entry of matrix M 2 is given by the inner product of A(i,j ,and A[ 3; ... ; 2 d] ■ F°r ease of 
notation, we also denote M 2 the mode-(l,2) unfolding of Ad\. We use the induction method on d. When d = 1, Q holds. 
Suppose ([5j holds when d = l > 2. When d = l + 1, there holds 


v = 


> 


> 


> 


> 


(A1i,x 3 0 ■••0x 2(i+1) ) 

^ fc=2 (from Step 4) and the induction) 

nt^wiiMiik _ ni=l<* w (MuX) 

n 3/2l-2 . n - n 3/2l-l 


ULl A[3;... ;2(Z + 1)]) 

n 3/2l-l 

rifc= 2 a (fc)( M 2 : X l ® X 2 ) 

n 3/2l-X 

IIfc=i a ( fe)|!A ( i j 2 )|| 2 (x ] r X [ 1 . 2 ]X 2 ) 

n 3/2l-l 

IIfc=i a (fc)||A ( i ;2 )||2 IIfc=i «(fc)||A|| 2 

n 3/2l-l/2 ~ n 3/2(Z+l)-2 > 


(from 0 ) 


where the second and the last inequalities follows from the relationship between matrix spectral norm and Frobenius norm. 
Therefore, ([5]) has been proved. ■ 

After applying Subroutine |T| we can perform the following block coordinate updating subroutine [32j to further improve 
the solution quality. That is to say, we can get a larger value (A, X 4 ,..., x 2( ;) than (A, xi <8 • • ■ <g> x 2 ( z) obtain in ([5]). 

Subroutine 2: (xi,..., x 2 ( z) = BCU(A, x 2 ,..., x 2 ( z) 
for i = 1 ,..., d 

Compute the singular vector pair (x 2 j_i, x 2 j) corresponding 
to the leading singular value of the matrix Ax 1 x 1 x 2 - - ■ x 2) _ 2 

x 2i-2 X 2 i + i X 2i+ i X 2i+2 '■ ‘ X 2 d X 2 d- 

end for 
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Of course. Subroutine [2] can be applied several times after Subroutine [I] to get a better solution. However, considering the 
trade-off between the computational cost and the solution quality, performing it a few times is enough to get an acceptable 
solution. 

We discuss the computational complexity. At the fc-th recursion of Subroutine [I] the complexity is ) + 0{n 2 )\ 

Subroutine |2]is 0(n 2d ) + (){n 2 ). Thus the total complexity of Subroutine [T] together with Subroutine [ 2 ] is Y^t=i 0(n 21 ), which 
is slightly worse than the power method for computing the leading singular value of a matrix of the same size of the tensor 
considered here. Therefore, comparing with methods based on SVD, e.g., Q, HoMPs may be more efficient: for TV-th 
order tensor space, at each iteration, methods based on SVD require to perform TV SVD, with complexity at least 0(Nn N+1 ), 
which is higher than ours. 

Finally, we note that tensors whose order is odd can always be treated as a tensor of even order, e.g., a tensor A £ R ni X " 2X " 3 
can be seen as in the space R 1XTl ixn2xn,3. 


IV. Convergence rate analysis 

In this section, we will establish the linear convergence rate of Algorithm |T| for tensor completion and MLMTL, either with 
a convex cost function with least squares loss, or with a possibly nonconvex cost function. Tensors in this section are assumed 
to be of order TV, W £ T. A key property for the analysis is inequality 0- To fit into the language of Algorithm [I] we write 
it as 

(VF(W^),S^) > p||VF(W (fe) ) (li2) || 2 , (6) 

where 0 < p < 1 is a ratio depending on the size of the tensor, as that derived in d5l, and VF(yV ( - fc - > )n o\ is the mode-fl, 2) 
unfolding of VF(W«). 

This section is organized as follows: in subsection |IV-A| we prove the linear convergence for tensor completion and MLMTL 
when F(-) is associated with a least squares loss; in subsection IV-B we extend the linear convergence results to F(-) with a 
possibly nonconvex loss. 


A. Least squares 

1) Tensor completion: In the least squares sense, the cost function of tensor completion can be written as 

F{W) = l/2||W n -Bn||F. 

The following theorem will establish the linear convergence rate of HoMP-LS, HoRMP-LS and HoOMP-LS uniformly in terms 
of the objective value F{W). 

Theorem 1 (Linear convergence rate for tensor completion): Denote VJ' k ' > := Wq ' 1 — Bn, and let S^ be generated by 
Subroutine 0 with input VT(W (k) ), and (0 holds. If {>V ( <} is generated by HoMP-LS, HoRMP-LS and HoOMP-LS, then 
there holds 


F(yv (k+1) ) < (1 - ) F(W {fc) ). 

V nin 2 J 


Proof: We first consider HoOMP-LS. For clarity we denote the weights a. at the k -th step as or = (a^,... ,a^). For 


convenience we denote ||Aq||p’ = ||7F||n. By the definition of 7and Y\J^ k+1 \ there holds 

k 

2F(W (fe+1) ) = \\TZ {k+1) \\ 2 F = \\W( k+ V-B\\ 2 n =mm\\Ya i SW- J 


lln 


i=0 


k -1 


< min || a k ~ 1 S^ + aS^ — B\\ 


i =0 


= min ||W (fe) - B + aS^Wl = min ||72.( fe ) + cuS( fe )|| 


n- 


For HoRMP-LS, we have 


ll^ (fc+1) ll! 


min + a 2 S^ — B\\ 

(o;i,Q2) 


< min ||W (fe) -B + aS^Wl = min ||7^(fc) +a 5(fc)||2_ 


It follows that for HoMP-LS, it naturally holds ||7?. ( - fe+1 ^|||, = minQ \\VS kS> + ||q. Therefore, we can uniformly analyze 

the upper bound in terms of minQ, \[R,^ + cuS^Hq. We have 

<11-—) \\n^\\ 2 F , 


mm||^ fc ) +aSW\\ 2 n = \\U^\\ 2 F ~ H^Hf - P^^Wl 


nin 2 ) 


where the first inequality follows from ( 7 \!/ fc ), = (VF(W ( ' k ^),S < ' k ^) > p\\ VF(W( fc ))( 12 )||| = p||R^ |||, 

inequality is due to the relationship between the spectral norm and the Frobenius norm. 


and the last 
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2) Multilinear multitask learning: In the least squares sense, the cost function of multilinear multitask learning is given by 

T m t 

F(W) = l/2^m t - 1 E((x‘ I w t ) -y‘) 2 - 

t=l i = 1 

Letting X* G W ntXD be formed by stacking the samples (transposed to rows) corresponding to the t- th task row by row, i.e., 
(X t ) T = [xj,... , x^J, we derive a compact form of F 

T 

F(W) = l/ 2 ^m t - 1 ||X‘w t -y‘|| 2 F , 

t= 1 

and the corresponding gradient, which is represented by the mode-1 unfolding of VF(W), can be written as 

VF(W)d) = [(X 1 ) T (X 1 w 1 - y^/mr,..., (X T ) T (X T w T - y T )/m T \ . (7) 


We have the following results. 

Theorem 2 (Linear convergence rate for MLMTL): Assume that the matrices X 4 (X f ) 1 are all positive definite, 1 < t < T, 
whose smallest eigenvalues are uniformly lower bounded by A m i n , while the largest eigenvalues are uniformly upper bounded 
by A max . Let be generated by Subroutine [l] with input VF(W^), and (|6]i holds. Let m max := maxi <t<Ttnt- If {W^} 
is generated by HoMP-LS, HoRMP-LS or HoOMP-LS, then there holds 


F(W {k+1) ) < (1- An in m^ x \ F ^k)y 

\ ^'1^'2'^max 1 / 


Proof: We denote r 4, ( fc ) := X 4 w 4, ( fe ) — y 4 , 1 < t < T. We first consider HoOMP-LS. Similar to the previous analysis, 
we have 


T T 

F(W {k+1) ) = V —||r 4 ’ (fc+1) ||i = V — ||X 4 w‘-( fc+1 ) -y 

t=i 2wt Gt 2mt 


t II2 

F 


= i! xt (E^' (i) )-y : 

t =1 z i =o 


t II2 
F 


1 


k -1 


t M 2 
F 


< miny^-||X*(V a!f + as*’^) — y 

t= 1 1 i=0 

T T 

= min V -—||X*w‘>C fc ^ - y * + aXV’^H! = min V -— ||r t >f fe > + aX 4 s 
a ' 2m+ ct 2 rrif 

t =l 1 t =l 1 

We then consider HoRMP-LS as follows 

T T 

^(W (fe+1) ) = ^ 2 ^ r l|r 4 ’( fc+1 )||^ = y]^ r l!X i w t '( fc + 1 )-y 4 || 2 F 

t=i z 4 t=i z 4 


, 4 >( fc )l||,. 


1 ^ ^ 

= min -i|X 4 (aiw 

Om, 11 v 


(ai,o 2 ) 2 mt 


4, ( fc ) + a 2 s 4, ( fe )) — y 4 |||. 


T T 

< min V —||X 4 w 4 ’W -y 4 +aX 4 s 4 ’W|| 2 = min V-||r 4 ’^ + aX 4 s 4 ’W||i. 

« 2wt a ' 2mt 

t=i E t=i 1 

And it naturally holds F(W^ fc+1 ^) = min Q Y^t=i 2 mfll rt + <xX 4 s 4, ( fe ) |||> for HoMP-LS. We then analyze the upper bound 


of min a EtLi 2 rh _ ll rt ’^ fe ^ + «X 4 s 4 ’( fc )|||.. From the optimality condition 


V — (x 4 s 4 >W) T (r 4 ’W + dX t s i, ^ ) ) = 0 


we get 


a = —- 


Ef =1 ^<X 4 s 4 >( fc ),r 4 >W> 

EL 1 m 4 - 1 ||X 4 s 4 .W||| 





and so 


1 „ tm„o ^ 1 n (ELi TO t 1 ( X ‘ st,(fc) >r t ’ (fe ))) 2 


^( W (fe+!)) < min V"-||r‘'( fc ) + aXV’Wlll = V-||r t -< fc >|||. - 

rv ^' 2?7 ^ j ^^ 9m ^ 


2m t 


aE^imrlxv.wni 


( 8 ) 


Recalling the definition of r t, ( k '> and V F(W^), the numerator of the second term is exactly (V-F(VV^), S^} 2 . Using the 
inequalities 

(VF(W (k) ),S (k) ) > p||VF(VV w ) (1)2) ||^ > —1|VF(W (fe ^)||j 


niri2 


\l , 


and from the assumption that ||X t s t A fe )|||, < A ma x||s t ’^|||- and ||(X t ) T r‘A fe )|||, > A m j n ||r t ’( ft )|||i ) the second term of <[8]i can 
be lower bounded as follows 


(ELi^xv-wy-w)) 2 

2 Et=t m t" 1 ||X t s t, ( fc ) Up 


> 


> 


> 


|VF(WW)|| 


2nm 2 E i =r TO t _1 l! xtst ’ (fc) llF 


pA n 


E t =i' 


7- 2 llrf( fc )||2„ 


2nin 2 A max Et=i m t 1 ||s t A fc )| 


pA m i n ?7Z 


-1 

max 


tun 2 A max Et=r m t 1 fzt 2m * 


—y—i 


r .t,(fe)||2 


If- 


This together with ([8) yields 

F(W (k+1) ) < f 1 - P^minTTlra. ax 


tiitt 2 A max i 


-)E^rii r ‘'“ ) ii5’ = I 1 - ,> (#'), 

- I ■ ■ /TO * ' rnn 2 A max 2^ t=1 m t J 


as desired. ■ 

In real world applications, however, the assumption that every matrix X*(X 4 ) 1 is positive definite may not hold, e.g., when 
the sample size mt is larger than the size of the feature space D. To establish the linear convergence in this setting, we consider 
the following two cases: 

1) (X*) T X* is positive semidefinite; we add the L 2 regularization to the cost function, i.e., now the new cost function is 


F(W)= J F(W) + A^^ r ||w t ||^ 


t —i 


where A > 0 is a regularization parameter. In this case, we rewrite F as 


F{W) 


G(W) + G 



where X* e M. DxD is the square root of (X t ) T X t + XI and is a symmetric matrix, z* = (X t ) _1 (X t ) T y t , and C := 
Er=r(2 m *) _1 (l|y t II f — ||z*||^,) denotes the constant term. Now X^X*) 1 ^ is positive definite, which meets the assumption of 
Theorem [2] 

2) (X t ) T X t is positive definite; we simply set A = 0 above, and also obtain that X*(X 4 ) T is positive definite. 

Therefore, we have linear convergence rate in the sense of G(-). 

Corollary 1 ( Linear convergence for MLMTL without the positive definiteness assumption): Assume that W G T. Let Ff), 
G(-) and X f be defined as above. Assume that the smallest and the largest eigenvalues of matrices (X*) 2 are respectively 
lower bounded by A mill and upper bounded by A max . Let 0 < p < 1 be defined as in <0, and let m max := maxi < t < T m t . If 
{Vy( fc )} is generated by HoMP-LS, HoRMP-LS or HoOMP-LS with the cost function given by Ff), then there holds 

G(yy( fe+1 )) < ( 1- Anm^mLc j £(^(*0). 

y nin 2 A max i rrit J 


B. A class of loss functions 

In this subsection, we conduct the convergence rate analysis for a class of loss functions which may be nonconvex. To this 
end, we first present some assumptions that characterize such class of functions £{t): 

Assumption 1: 1) £(t) > 0, f(0) = 0, £{t) = £(—t), £{t) < t 2 / 2, Vi € M, and £(t) is coercive; 

2) l(t) is continuously differentiable; 

3) |/(s) — /(f)| < |s — t\, Vs,t G R; 
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4) denote ijj(t) := £ (i)/f; then 0 < ip(t) < 1, and 0) exists and is finite; 

5) ip(t) 0 if t -ft oo. 

The above assumptions are not restrictive, as they are met for a variety of loss functions, e.g., the Huber’s loss 07), the 
L\ — L 2 loss 2(y/l + 1 2 /2 — 1), the Fair loss cr 2 (|f|/cr — log(l + \t\/&)), and the Cauchy loss <t 2 /2log(l + t 2 /a 2 ). Here a 
is a parameter. Note that the Cauchy loss is nonconvex. We can also define the generalized Huber’s functions that satisfies 
Assumption IT) 

m / i2 /2 1*1 < 6 

( ) ' \ 5 2 ~P(\t\P/p + Sp/ 2 - SP/p ) \t\ > S , 

where <5 > 0 is a parameter and 0 < p < 2. When p = 1 it reduces to the Huber’s loss; when p = 2 it is exactly the least 
squares loss. If p < 1 then this class of functions is also nonconvex. Fig. [T| plots the losses mentioned above. 



(a) Different loss functions. 





(b) Generalized Huber’s loss functions with different p values, with <5 = 1. 
Fig. 1: Different loss functions. 


We still begin our analysis from the tensor completion setting. 

1) Tensor completion with a general loss: With the loss function (it) satisfying Assumption [I] the cost function of tensor 
completion is given by 

F(W)= Y, mv '« .',)• (9) 

(zi 

That is to say, we penalize the tensors entry by entry with £(•). Accordingly, the gradient of F(-) at VV can be derived as 

VF(W) = $ S] o(W-B) f! , (10) 





























10 


where o denotes the Hadamard operator, i.e., entry-wise product, and T is defined as T,;, ..., v = 'tp('IZ, t ... lN ) where we denote 
7Z := W — B, and 'ip is defined in Assumption [l] In robust statistics p-T) , with an appropriate loss such as those mentioned 
below Assumption |T[ each entry of T can be seen as a weight assigned to the corresponding entry of 7 Z. As the entry of 7Z 
goes large, the corresponding entry of T will decrease to control the influence of the large deviations. 

If the cost function of tensor completion is given by |9|, then one can use Assumption [l]-3) to verify that 

||VF(W)- VF(V)H f < ||W-V||f, (11) 


i.e., the gradient of F(-) is Lipschitz continuous with constant being 1. Therefore, we can apply strategy 4) of Algorithm [T| 
HoMP-G to solve it. In the following, we present our results and analysis for tensor completion solved by HoMP-G. 

Theorem 3 (Linear convergence for tensor completion with a general loss): Assume that the cost function is given by ([9]) 
with a loss function £(■) satisfying Assumption 111 Let p be defined as in ([6]). If {VV ,/l - ) } is generated by HoMP-G, then there 
holds 

F(W {k+1) ) < (l - -??—') F(W (fc) ), 

V nin 2 ) 


where 0 < q < 1 is some constant. 

Proof: Let Yp( k+1 '> = W ^ + aS^ where a = — (VF(}V l ' k ' > ) 1 S f ' k ' > ). According to ( pT| ), we have 

F(w {k+1) ) < F(w (fe) ) + (VF(w (fc) ),w (/c+1) -w {k) )+ 2~ 1 \\w { - k+1) -w {k) \\ 2 F 

= F(W {k) ) - 2- 1 (VF(W (fe) ), S (fc) ) 2 

< F(W (fe) )--—1| VF(W (fe) )||p. 

2n\ri2 

Therefore, { F(yV <k:> )} is nonincreasing. From the definition of F(-) and Assumption [lj-1), it follows that {>Vf * } is uniformly 
bounded. We consider the term ||VF(yV’( fe ))||^. From ( flO) , it follows 

I|VF(WW)|| 2 = ^ V’(< fe) .., ijv ) 2 (< fc , ) .. lijv ) 2 , 

where 7 Z^ = Wq ' — Bn. From Assumption |Tj-5), the boundedness of and Bn implies that i/j(7Z\ k ^ iN ) is uniformly 

lower bounded away from zero for all k > 0 and for all (zi,...,ijv) £ 9. Without loss of generality we assume that 
'tp{TZ\ k) ijv ) > q > 0, where the magnitude of q only depends on the magnitude of Bn- On the other hand. Assumption [lj-1) 
tells us that (7 Z^ iw ) 2 > 2£(piZ^ ijv ). As a consequence. 


||VF(W (fe+1) )|||. > 2 q 2 F{W (k) ), 


and so 


F ( W (fc+ 1) ) <(i - F(W (fe) ), 

V nin 2 ) 


as desired. 

2) Mutilinear multitask learning with a general loss: In this setting, the cost function is given by 

T m t 

t=1 i=1 


( 12 ) 


That is, the noise or outliers are penalized sample-wisely by using £(■). The gradient of F(-) at W can be derived as follows: 
its gradient at w f is given by 

(X^A'fXV-y 


where X 4 is the same as that defined in subsection 


IV-A2 


and A £ 


is a diagonal matrix, whose i-th diagonal entry 


A\i = — Vi)- Therefore, the gradient of Ff) at W in terms of its mode-1 unfolding can be written as 


VF(W) (1) = (X 1 ) t A 1 (X 1 w 1 - y 1 )/mi, (X T ) T A T (X T w T - y T )/m T 


(13) 


The difference between ( fl3] l and its least squares counterpart ([7]i is the matrix A 4 , which acts as a weight matrix to remove 
the large deviation between X 4 w 4 and y* if necessary. Using Assumption [l]-3), one can verify that its gradient is Lipschitz 
continuous. 


||VF(W)- VF(V)||f < 



c||W-V||jr, 


(14) 


where we use A max = maxi < t <T ||X* ||§ as a Lipschitz constant. Thus HoMP-G can also be applied. We present our convergence 
rate results in the following. 
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Theorem 4 (Linear convergence for multilinear multitask learning with a general loss): Assume that the cost function is 
given by © with a loss function £(■) satisfying Assumption [I] Assume that the matrices X^X 4 ) 1 are all positive definite, 
1 < t < T, whose smallest eigenvalues are uniformly lower bounded by A m j n , while the largest eigenvalues are uniformly 
upper bounded by A max . Let p be defined as in and let TO max := maxi<t<T mt. If {yy( fc )} is generated by HoMP-G, 
then there holds 

F(W < - k+1 '>) < (l - P^ mi “ m maxg \ F(W < ' k) ), 

\ niri2A max J 


where 0 < q < 1 is a some constant. 

Proof: We denote r t ^ k 1 := X*w t ’^ — y*, 1 < t < T. According to the strategy MP-non-LS in 
yy(fc+i) _ yp(fc) -|-a<S^, where 

5=-(VF(#'>),5 w )/A max . 


Algorithm [l] 


we let 


Using ( fl4| ) and similar to the proof of Theorem [3] we have 

F(w (fc+1) ) < F(w (fe) ) + (vF(w (fc) ),yv ;(fc+1) - w (fc) ) + ^y^||>v (fc+1) -yv (fe) ||| 

= F(W (fe) )-(VF(W (fc) ),«S (fc) ) 2 

< F(W (fe) )-- - — x -||VF(W (fc) )|||.. 

27li?7.2 A max 

It follows that {F(W (fc) } is a nonincreasing sequence. This together with the definition of F(-) and Assumption[l]-l) implies that 
all the sequence {r*’( fc )} are uniformly bounded, 1 < t < T. Recalling Assumption |T]-5) and recalling A A = i/)((x-,w f ) — yj), 
the boundedness of {r*A fe )} also implies that there is a universal constant q > 0 such that A- > q for 1 < i < m t , 1 <t <T, 
and k > 0, where the magnitude of q only depends on the magnitude of the samples {x ■. if}. Now we can consider the term 
||VF(W W )||!,. From ([13), it follows 


||VF(W (fc >)|||. = 


> 


> 


^m t - 2 ||(X t ) T A t (X t w‘-( fc )-y i )||| 

t= 1 

T 

Amin m~l x Yl TO t~ 1 |l A*(X*W t ’^^ - y*)|||’ 

t =1 

T 

t =1 


T m t 

> A min m-^q 2 ^ m t _1 ^ £((x*, w‘) - y\) 

t= 1 2=1 

= 2A min m-^q 2 F(wM), 

where the first inequality is due to ||X* ||| > A m i n , and the last inequality follows from Assumption |T]-1), and so 

F(W (fc+1) ) < (l - ~\ F(W {k) ). 

V tlltt2A max J 


The proof is completed. 

To study the case that the matrices X*(X*) T 
cost function as that discussed in subsection 
semidefmite. 


IV-A2 


may not be all positive definite, we also make some modifications to the 

is positive 

and reconstruct the cost function as 


We let (X*) 2 := (X‘) T X t + XL where A > 0 if (X t ) T X‘ 


and A = 0 if (X*) T X t is positive definite. We denote z* = (X*) 1 (X i ) T y* 


T m t 

Gm = Y2 m i 1 J2 e ^ w 4 ) - 4 ), G5) 

t=l 2 = 1 

where x* g M D is the i- th column of X*. Similar to Corollary [I] we have 

Corollary 2 (Linear convergence for multilinear multitask learning with a general loss and without the positive definiteness assumptioi 
Assume that the cost function is given by © with a loss function £(•) satisfying Assumption [T| Let G(-) and X* be defined 
as above. Assume that the smallest and the largest eigenvalues of matrices (X 4 ) 2 are respectively lower bounded by A m i n and 
upper bounded by A max . Let p be defined as in and let m max := maxi <t<Ttrit. If {W^} is generated by HoMP-G, 
then there holds 

G(W (fc+1) ) < (1 - P Amhim maxg~ \ Q(yy(k)y 

V tt'l r l2A max J 
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Remark 1: Before ending this section, we remark that 1) inequality ([3]) is very important in deriving the linear convergence. 
Without it, only sublinear convergence rate can be obtained; 2) all the convergence rates obtained in this section can be 
improved in theory by using, e.g., the methods in 1381, 1391 to improve the ratio p in while getting less efficiency. 


V. Numerical Experiments 

In this section, we present some numerical experiments on synthetic data as well as real data, focusing on the applications: 
(robust) tensor completion and MLMTL. All the numerical computations are conducted on an Intel i7-3770 CPU desktop 
computer with 16 GB of RAM. The supporting software is MATLAB R2013a. 


A. Tensor completion 

Our HoMP-type methods with least squares loss are compared with 4 state-of-the-art methods: generalized conditional 
gradient (GCG) ]15| , HaLRTC (T), factor priors (FP) 0 and TMac (42) . GCG also solves an approximate tensor singular 

value problem Qat each iteration; HaLRTC solves a convex relaxation of tensor completion with sum of matrix nuclear 

norms; FP uses some prior knowledges of the tensors, and TMac is based on tensor factorization. The stopping criterion for 
HoMPS is when the residual TZ <k ^ used in Sect. IV is less than a threshold. For all the methods except HaLRTC, the threshold 
of the stopping criterion is e = 10~ 5 ; for HaLRTC, we set 10~ 6 because otherwise it cannot generate a good result. The max 
iteration for all the methods is 500, whereas for HoMPs, the max iteration K is the only parameter, which is tuned by 10-fold 
cross validation over K £ {100,200,300,400,500}. All the results are averaged over ten instances. 

1) Synthetic data: Third order tensors of size 200 x 200 x 200 are randomly generated, with CP-rank 10. Some entries are 
randomly missing, with missing ratio (MR) varies in {0.5,0.6,0.7,0.8,0.9,0.95,0.99}. The relative error \\X* — 'S||f/||£>||f 
and the computational time (in Second) are respectively reported in Fig. [2] and Table [I] The results in Fig. [2] show that HoMPs 
and GCG have better performances, particularly when the MR value is very high. The results in the subfigure of Fig. [2] show 

that HoMPs perform better than GCG when the MR values in [0.5, 0.9]. Table [I] shows that HoMPs are more efficient than 

other methods, particularly when the MR value is very high. That is because we have optimized our codes by utilizing the 
sparsity of the data and using the sparsity manipulation in Matlab. Comparing between the three HoMPs, it is interesting to see 
that the relative error of HoOMP is not as good as the other two. That may be because HoOMP learns the tensor more greedily 
and leads to overfitting. And HoOMP is the slowest among the three methods, as it has to solve a larger linear equation system 
to obtain the weights. 



Fig. 2: Tensor completion results on synthetic data (200 x 200 x 200, CP-rank = 10) in terms of relative error. 


TABLE I: Efficiency comparison of different methods on tensor completion on synthetic data (200 x 200 x 200, CP-rank = 10). 


MR (%) 

HoMP 

HoRMP 

HoOMP 

G S 

HaLRTC 

s 

TMac 

P 

50 

16.09 

20.95 

45.02 

126.14 

42.89 

88.18 

23.00 

60 

13.85 

17.77 

37.38 

106.60 

54.38 

97.64 

26.56 

70 

11.41 

14.27 

28.29 

85.72 

72.21 

121.94 

32.75 

80 

8.63 

10.66 

19.84 

64.13 

98.76 

149.37 

42.78 

90 

5.25 

6.28 

10.79 

38.89 

97.03 

100.28 

49.41 

95 

3.45 

3.94 

6.05 

24.69 

96.72 

71.72 

46.44 

99 

1.98 

2.05 

2.35 

14.41 

1.98 

54.65 

46.06 
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TABLE II: Comparisons of different methods on tensor completion real datasets 




HoMP 

HoRMP 

HoOMP 

GCG 

mi 

HaLRTC 

□ 

FP 

IE 

TMac 

E3 

Datasets 

MR (%) 

Relerr 

Time (S.) 

Relerr 

Time (S.) 

Relerr 

Time (S.) 

Relerr 

Time (S.) 

Relerr 

Time (S.) 

Relerr 

Time (S.) 

Relerr 

Time (S.) 


70 

8.10E-02 

1.17 

8.16E-02 

1.50 

8.05E-02 

2.64 

7.85E-02 

6.47 

6.93E-02 

32.09 

5.47E-02 

195.12 

2.45E-01 

61.62 

Lena 

80 

1.06E-01 

0.92 

1.07E-01 

1.08 

1.05E-01 

1.78 

9.70E-02 

4.67 

9.59E-02 

44.14 

8.3 IE-02 

199.84 

4.01E-01 

57.88 

(512x512x3) 

90 

1.56E-01 

0.31 

1.56E-01 

0.36 

1.58E-01 

0.42 

1.42E-01 

2.28 

1.71E-01 

68.94 

1.34E-01 

64.42 

7.15E-01 

45.83 

95 

2.17E-01 

0.24 

2.19E-01 

0.27 

2.44E-01 

0.32 

2.02E-01 

1.66 

3.77E-01 

75.53 

5.92E-01 

32.04 

9.09E-01 

40.40 


99 

4.10E-01 

0.21 

4.38E-01 

0.20 

4.90E-01 

0.22 

4.00E-01 

1.05 

8.61E-01 

72.94 

9.75E-01 

14.03 

1.01E+00 

18.11 


70 

1.17E-02 

43.37 

1.18E-02 

51.36 

1.17E-02 

259.93 

3.41E-02 

47.62 

3.00E-02 

35.33 

3.67E-02 

442.16 

1.81E-02 

244.31 

Spectral 

80 

1.45E-02 

34.76 

1.45E-02 

40.13 

1.44E-02 

175.54 

3.68E-02 

35.24 

4.24E-02 

35.19 

6.65E-02 

444.89 

2.65E-02 

227.77 

(205X246X96) 

90 

2.24E-02 

24.40 

2.23E-02 

26.87 

2.20E-02 

90.14 

3.90E-02 

21.97 

6.46E-02 

56.31 

6.74E-02 

210.08 

6.31E-02 

208.89 

95 

3.74E-02 

7.34 

3.80E-02 

7.54 

3.68E-02 

11.83 

4.38E-02 

14.78 

1.05E-01 

132.35 

1.74E-01 

91.23 

1.71E-01 

198.56 


99 

8.74E-02 

2.25 

8.97E-02 

2.35 

9.40E-02 

2.47 

8.28E-02 

9.59 

8.50E-01 

129.95 

9.56E-01 

33.66 

6.94E-01 

183.16 


70 

1.55E-02 

42.00 

1.54E-02 

51.41 

1.53E-02 

331.32 

5.98E-02 

65.01 

3.22E-05 

133.70 

1.30E-03 

86.05 

5.81E-02 

286.55 

MRI 

80 

2.11E-02 

33.20 

2.14E-02 

39.97 

2.10E-02 

177.95 

6.54E-02 

47.25 

2.89E-03 

148.18 

2.04E-03 

116.64 

9.68E-02 

322.64 

(181X217X181) 

90 

3.96E-02 

21.44 

3.97E-02 

24.47 

3.83E-02 

89.69 

7.32E-02 

27.37 

9.03E-02 

158.40 

4.20E-03 

150.70 

2.26E-01 

309.16 

95 

8.35E-02 

7.13 

8.30E-02 

7.84 

7.95E-02 

15.20 

8.59E-02 

16.90 

3.44E-01 

186.62 

2.69E-01 

120.94 

4.47E-01 

308.42 


99 

3.18E-01 

2.03 

3.33E-01 

2.24 

3.46E-01 

2.51 

3.30E-01 

9.17 

9.95E-01 

2.77 

9.54E-01 

52.91 

9.08E-01 

273.60 


70 

1.12E-01 

90.95 

1.14E-01 

116.51 

1.88E-01 

57.08 

1.79E-01 

328.28 

1.14E-01 

288.78 

7.35E-02 

697.61 

1.89E-01 

530.81 

Knixl 

80 

1.26E-01 

64.18 

1.28E-01 

80.70 

2.03E-01 

41.70 

1.92E-01 

281.14 

1.60E-01 

359.90 

7.85E-02 

716.28 

2.37E-01 

904.77 

(512X512X3X22) 

90 

1.62E-01 

37.69 

1.62E-01 

44.26 

2.23E-01 

23.14 

2.09E-01 

211.19 

2.70E-01 

381.45 

9.57E-02 

714.48 

3.25E-01 

1237.13 


95 

2.21E-01 

9.27 

2.20E-01 

10.51 

2.16E-01 

38.80 

2.24E-01 

173.88 

4.36E-01 

383.07 

6.05E-01 

438.44 

4.02E-01 

1208.53 


99 

4.22E-01 

2.26 

4.14E-01 

2.02 

3.94E-01 

5.92 

3.65E-01 

134.75 

9.27E-01 

356.27 

9.86E-01 

138.46 

8.26E-01 

1098.88 


70 

1.34E-01 

94.85 

1.35E-01 

121.01 

1.93E-01 

63.89 

1.85E-01 

337.07 

1.50E-01 

396.27 

8.05E-02 

726.74 

1.50E-01 

562.98 

Knix2 

80 

1.52E-01 

66.01 

1.54E-01 

83.02 

2.08E-01 

41.80 

2.00E-01 

276.68 

2.04E-01 

390.28 

8.47E-02 

732.19 

1.92E-01 

918.55 

(512X512X3X24) 

90 

1.99E-01 

38.34 

1.99E-01 

45.50 

2.28E-01 

23.64 

2.18E-01 

208.10 

2.98E-01 

391.50 

1.19E-01 

732.63 

2.68E-01 

1278.58 

95 

2.53E-01 

9.57 

2.53E-01 

10.74 

2.42E-01 

38.86 

2.46E-01 

167.50 

3.89E-01 

385.79 

6.10E-01 

441.75 

3.86E-01 

1230.64 


99 

3.85E-01 

2.21 

3.78E-01 

2.00 

3.80E-01 

5.76 

3.52E-01 

131.51 

9.95E-01 

7.80 

9.85E-01 

148.51 

1.12E+00 

1197.74 


70 

7.11E-02 

179.43 

7.20E-02 

211.71 

1.07E-01 

45.52 

9.98E-02 

360.14 

7.42E-02 

327.47 

7.56E-02 

1191.23 

7.75E-02 

1688.45 

Tomato 

80 

7.74E-02 

133.91 

7.77E-02 

153.38 

1.13E-01 

29.72 

1.07E-01 

264.58 

9.63E-02 

436.98 

7.98E-02 

1203.06 

8.33E-02 

1597.08 

(242X320X3X167) 

90 

8.51E-02 

81.57 

8.54E-02 

87.78 

1.21E-01 

16.21 

1.14E-01 

158.06 

1.35E-01 

646.06 

9.51E-02 

1212.66 

8.90E-02 

1496.16 

95 

9.55E-02 

48.74 

9.55E-02 

50.38 

1.13E-01 

25.52 

1.17E-01 

98.14 

2.44E-01 

634.48 

4.32E-01 

825.22 

1.93E-01 

1435.33 


99 

1.40E-01 

14.01 

1.40E-01 

12.29 

1.43E-01 

28.90 

1.30E-01 

41.69 

8.81E-01 

634.68 

9.75E-01 

293.10 

7.41E-01 

1389.03 


2) Real data: Several real data sets have been chosen: the Knix datasets, the Tomato video, the Hyperspectral images, the 
brain MRI and a color image Lernf] Knix consists of two tensors of order-4, Tomato is also a tensor of order-4, and the 
remaining datases are 3rd order tensors. Since the order of magnitude of tensors in Knix and Tomato is 10 7 , for these datasets 
we set the max iteration of FP as 50, because it is too time-consuming at each iteration. For the same datasets we also restrict 
the max iteration of HoOMP by 100 because at each iteration it has to solve a relatively large linear equation system. For 
HaLRTC on Tomato, the max iteration is set to 100 due to efficiency. The results are reported in Table [II] where we can 
observe that in most cases, HoMPs have acceptable or better performances, but need less time. HoMPs perform particularly 
well on the Hyperspectral images: even the MR value is 99%, the relative error is less than 0.1, which maybe due to the fact 
that the tensor in this dataset is indeed low rank. On Tomato, which is of size (){ 10 7 ), our methods take much less time than 
competing methods and obtain better performances. On Knix and Lenna, FP outperforms other methods when the MR value 
is not high, which maybe because it uses some prior knowledges, however, mining the knowledges might be time-consuming, 
making it slower than other methods. When recovering the color image, HoMPs have a 30x speedup comparing with HaLRTC, 
and lOOx speedup comparing with FP, which may be useful in real world applications. In Fig. [3] results recovered by different 
methods from one slide of the Tomato dataset with 0.95 missing ratio are illustrated to intuitively evaluate the performances. 
From the results we can observe that HoMP and HoRMP recover more details than other methods. 



Fig. 3: This example intuitively shows slides recovered by different methods from the 37-th slide of the Tomato dataset with MR= 0.95 


B. Multilinear multitask learning 

Our HoMP-type methods with least squares loss are compared with 4 state-of-the-art methods: sum of (overlapped) nuclear 
norm (Overlapped for short) @•0 , latent nuclear norm (Latent) j6j , scaled latent nuclear norm (Scaled) j6||, and a nonconvex 
approach (Nonconvex) 0 - The first three methods are based on nuclear norm regularized convex optimization, while the last 
method factorizes the tensor into factors, which is nonconvex. We also note here that the difference between the Overlapped 
method in |5| and |6| is that |5j uses the sum of nuclear norm as a regularization, while |6j treats it as a constraint. In the 
following we consider them as the same method. The stopping criterion is the same as the previous experiment. Parameters 
are tuned via 10-fold cross validation. Specifically, I\ £ {15, 20, 25,..., 50}. The following datasets are chosen: 

1) School dataset. This datasets is made available by the Inner London Education Authority (ILEA), which consists of 
examination records from 139 schools in years 1985, 1986 and 1987, with 15362 students. Each task is to predict exam scores 
for students in each school, where the size of the input space is 25, with one indicating the bias term. Following |[6j, we model 
it as a MLMTL problem, where each task has two indices: the school index and the year index. Therefore, the tasks jointly 

1 Knix can be downloaded from http://www.osirix-viewer.com/datasets/ and Tomato, Hyperspectral images and brain MRI are available at https://code. 
google.com/p/tensor-related-code/source/browse/trunk/Model/Tensor+Completion/LRTC_Package_Ji/?r=6 
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gives a 25 x 3 x 139 weight tensor to be learned. The size of the training set varies from 2000 to 12000. Similar to |6), we 
use the explained variance 100 • (1 — MSE tes t/var(y)) to evaluate the performance of the compared methods. 

2) Restaurant & consumer dataset [43 This dataset consists of rating scores from 138 consumers to different restaurant 
from 3 aspects, with 3483 instances. Each task is to predict the rating given by a consumer from one aspect, provided a 
restaurant as an input. The size of the input space is 45, with one indicating the bias term. Since each task can be indexed by 
two indices: the consumer index and the aspect index, the tasks jointly yields a 45 x 3 x 138 weight tensor to be learned. The 
size of the training set varies from 400 to 2000. The test MSE is used to evaluate the performance of the compared methods. 

The results averaged over 20 instances on the school and the restaurant datasets are respectively reported in Fig. 4a and 
[4b] From Fig. [4a] we can observe that HoRMP and HoOMP perform comparable with or better than other methods on the 
school dataset, while HoMP does not perform well. Scaled |6] performs best among Fatent [16], Scaled (6| and Overlapped 
[5J, (6), which is in accordance with the observations in The method Nonconvex |5j is slightly worse than HoRMP and 
HoOMP when the size of the sample is small, and then it catches up our methods as the samples increase. For the restaurant 
dataset, HoMP is still worse than other methods, see Fig. [4b] HoRMP and HoOMP are not as good as other methods when 
the sample size is small, which may be due to the overfitting in these cases. However, they eventually catch up other methods 
when the samples increase, and are slightly better when the sample size is larger than 1600. To give a clearer view on their 
performances, we also use the bar plot to show the MSE of the compared methods in Fig. 4c The efficiency comparisons are 
reported in Table [III] from which one can again see the efficiency of HoMP-type methods. 

During the experiments, we also observe that for our methods, around K = 25 iterations can give desirable results, which 
implies that the weight tensor VV can indeed be approximated by a tensor of rank lower than 25. 



(a) Explained variance (the larger, the better) for different methods on 
the school dataset. 



| HoMP 
|HoRMP 
] HoOMP 
] Latent 
S) Scaled 
| Overlapped 
| Nonconvex 


400 600 800 1000 1200 1400 1600 1800 2000 

Training size 


(b) Test MSE for different methods on the restaurant dataset. 


(c) Test MSE on the restaurant data using bar plot. 


Fig. 4: MLMTL results on two real datasets with different methods: HoMP, HoRMP, HoOMP, Latent j6j, Scaled |6j, Overlapped 


and Nonconvex 


0 


2 We would like to thank the first author of J 5 J to provide us the cleaned dataset. 
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TABLE III: Efficiency comparison of different methods on school and restaurant datasets. The results are averaged over all sample sizes and all the instances. 


Dataset 

HoMP 

HoRMP 

HoOMP 

Latent [6) 

Scaled (6j 

Overlapped (5), j6j 

Nonconvex |5] 

School 

0.33 

0.35 

0.36 

4.92 

1.44 

8.55 

110.31 

Restaurant 

0.45 

0.46 

0.47 

1.35 

5.81 

8.37 

110.17 


C. Robust tensor completion 

The goal of this subsection is to examine the effectiveness of the HoMP-G strategy. A suitable setting is the robust tensor 
completion, with the loss function instantiated by the Cauchy loss t a {t) = cr 2 /21og(l + t 2 /cr 2 ) with er fixed to 0.08. The 
compared methods are HoRPCA 144 Alg. 2.2] and RPCA |45) . HoRPCA and RPCA are designed to solve convex optimization 
problems, which employ the loss to penalize the noise or outliers, where RPCA is focused on robust matrix completion, while 
HoRPCA is for robust tensor completion. The stopping criterion and other settings are the same as the previous experiments. 

1) Synthetic data: Third order tensors of size 100 x 100 x 100 are randomly generated, with CP-rank 10. Some entries are 
randomly missing, with missing ratio (MR) varies from 0.3 to 0.99. 10% of the entries are contaminated by outliers drawn 
form [—1,1], We compare with HoRPCA in this experiment, and report the results in Fig. [5] We can observe that when the 
MR value is less than 0.6, HoRPCA is better than our method; when the MR value increases, the performance of HoRPCA 
decreases rapidly, while our method is more stable. This observation is similar to that in the experiment of tensor completion. 



Fig. 


5: Robust tensor completion results on synthetic data (100 x 100 x 100, CP-rank = 10) in terms of relative error. The compared method is HoRPCA 

|44l. 


2 ) Yale face: We compare our method and RPCA on removing shadows and specularities from face images, as that done in 
63 - The shadows and specularities can be seen as noise or outliers |45). The selected dataset consists of 64 face images of 


size 192 x 168, giving a 192 x 168 x 64 tensor. Previously RPCA treats it as a 32256 x 64 matrix [45 |, while we directly treat it 
as a tensor. We consider two settings: there are no missing entries and there are 60% missing entries. The maximum iteration 
for both two methods is 200. Part of the results are shown in Fig. [6] In the figure, row (a) shows some original images; (b) 
presents images recovered by HoMP-G (HoMP for short in this paragraph) and (c) are those recovered by RPCA. We can 
observe that both of the two methods can remove the shadows, while it seems from the first column that HoMP performs 
better, as it can remove the lines. However, images recovered by HoMP are not as clear as those recovered by RPCA. This 
may be because that HoMP yields a tensor of CP-rank 200, say ]C 2 ™ o 'x, 0 y, : 0 z,, to approximate the original data tensor, 
which is equal to that for the j-th image, 1 < j < 64, it is approximated by a matrix of rank at most 200, a'ZijXi ®y*. 
Since these x., and y, are not orthogonal, they may not be principle components, and hence may lead to loss of information. 
Rows (d)-(f) show the images with 60% entries missing and those recovered by HoMP and RPCA, respectively. In this case 
HoMP still performs stable, while RPCA cannot successfully impute all the missing entries. Finally, HoMP only requires 
(192 +168 + 64) • 200 = 169600 size to store the recovered tensor, while RPCA has to use 2064384 size to store the recovered 
matrix. Totally speaking, our method has the following features; it can remove shadows and specularities, impute missing 
entires, as well as generate a set of common basis {x.;,yi} for all the images, which has lower storage requirement, and is as 
efficient as the previous HoMPs. 


D. Linear convergence 

Last, we examine the linear convergence of HoMPs on three experiments, as shown in Fig. [7] In the figures, the y -axes is 
the logarithm of the cost function at iteration k, \og(F(]'Y t ^ kl )), while a;-axes stands for iteration. Specifically, Fig. |7a] plots the 
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Fig. 6: Comparison of HoMP-G and RPCA on removing shadows and specularities from face images, (a): Some original images, (b) Images recovered by 
HoMP-G. (c) Images recovered by RPCA. (d) MR = 60%. (e) Recovered by HoMP-G. (f) Recovered by RPCA. 


curves of HoMP-LS (blue), HoRMP-LS (red) and HoOMP-LS (green) on tensor completion; Fig. [7b] plots those on MLMTL, 
while Fig. [7c]plots the curve of HoMP-G on robust tensor completion. From the figures, we can observe that the curves confirm 
the theoretical results derived in Sect. El 


VI. Conclusion 

In this paper, we proposed higher order matching pursuit for low rank tensor learning. Comparing with some state-of-the-art 
methods, HoMPs have three important features: low computational complexity, low storage requirement, and linear convergence. 
Furthermore, HoMP can also be applied to problems with a nonconvex cost function, sharing the same convergence rate as those 
with a convex cost function. Numerical experiments on synthetic as well as real datasets verify the efficiency and effectiveness 
of HoMPs. 
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logarithm of the cost function at iteration k, log#-axes:iteration. The plots confirms the theoretical results in Sect. IV 
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